Add MMseqs2 clustering and taxonomy #6574

hugolefeuvre · 2024-11-19T14:28:13Z

FOR CONTRIBUTOR:

I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
License permits unrestricted use (educational + commercial)
This PR adds a new tool or tool collection
This PR updates an existing tool or tool collection
This PR does something else (explain below)

…ting

bernt-matthias · 2024-11-20T11:14:52Z

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

+ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&
+mmseqs createtaxdb
+    '$createtaxdb.database_type.mmseqs2_db_select.fields.path'


Can you try:

Suggested change

ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&

mmseqs createtaxdb

'$createtaxdb.database_type.mmseqs2_db_select.fields.path'

ls -lah '$createtaxdb.database_type.mmseqs2_db_select.fields.path'* &&

ln -s '$createtaxdb.database_type.mmseqs2_db_select.fields.path' taxdb &&

mmseqs createtaxdb

'taxdb'

If needed add an extension to taxdb.

It seems that in the end this script
is executed and I guess that the trick should work.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

Of course the option to allow users to provide their own mapping needs to be preserved.

Wondering if createdb and createtaxdb should be separate tools / data managers.

Also wondering if we should download all the databases with every job or if we should create (or reuse existing reference data)? At least the ncbi taxonomy dump files should already be handled by a another data manager. Not sure about the mappings to uniprot.

I don't know enough about this to be able to answer your question.

Wondering if createdb and createtaxdb should be separate tools / data managers.

We've already thought about it, but we thought it would be more interesting to be able to perform an end-to-end taxonomy analysis (fasta file to contig taxonomy) with a single galaxy module. But it can be discussed if you think it might be useful for other uses.

I guess the main question is if the output of createdb and createtaxdb can be used by multiple tools / reused later.

There are also disadvantages wrt reproducibility when the current DB is used. But of course it also has disadvantages.

My intuition tells me to split it. Also because in jobs that mainly download data need to be handled differently on compute systems. This is not possible if compute and download tasks are mixed in a tool.

But this is only my intuition and I leave this to you.

In my opinion, one of the major problems with separating createdb and createtaxdb is the output from these 2 modules. There are several reasons for this:

The output format is not supported by Galaxy

createdb creates several files which must be used together (in a collection at the very most)

createtaxdb creates a file which must be integrated with the other files in order to be able to use the DB as a taxDB, it can take as input either the files generated by createdb, or the files from a DB which doesn't have a taxo.

In other words, createtaxdb might not be in the actual xml because the databases I'm giving it as input at the moment are only the taxonomy DBs proposed by MMseqs, but there are other DM mmseqs2 data_tables that don't have a taxonomy and for which createtaxdb could be useful (https://github.com/soedinglab/mmseqs2/wiki#downloading-databases).

This is not possible if compute and download tasks are mixed in a tool

Could this be why I'm having a problem in my test?
Finally, I don't really know what to do. On the one hand, I think it's a good idea to split it up, as this allows more flexibility for the user, but on the other hand, it seems more complicated to set up and less intuitive for the user.

Could this be why I'm having a problem in my test?

Unlikely. The problem seems to be just that the tool tries to write next to the input files (the databases) .. which is not possible. If the symlink trick does not work the only option seems to be to copy the database to the working directory.

I'll try later by putting the database in a directory and making a symlink of that directory. Because the way the symlink was created only took into account the Swiss-Prot file in the database, but there are others (Swiss-Prot_h, Swiss-Prot_index...)

The directory datatype might be a solution for the datatype questions.

Finally, I don't really know what to do.

Follow your intuition. Your arguments are perfectly reasonable. And we can still improve this later.

bernt-matthias

Did a first quick review of the tools. Thanks already for the massive amount of work.

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

tools/mmseqs2/mmseqs2_taxonomy_assignment.xml

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

Ensure that the _taxonomy file is not written to the test data section Co-authored-by: M Bernt <[email protected]>

This reverts commit 940613d. Back before macros when tests passed

hugolefeuvre · 2024-11-22T12:11:09Z

It seems that the tests are not currently running because of problems with MMseqs2, in particular when downloading the taxonomy part for Uniprot (as the SILVA test is running correctly). I'll wait until the problem is solved, if it takes too long then I'll use other test databases

This reverts commit 09e0d94. Back to macros parameters since tool issue has been resolved

…N parameters (automatically set by tool)

hugolefeuvre · 2024-11-29T10:27:32Z

Hi @bernt-matthias, does the PR require further, more in-depth reviews (I guess so), perhaps by other reviewers ?

bernt-matthias

Had another look, mainly on the data manager.

Is there any relation between the nt and the tax-nt databases? For instance are there corresponding entries?

data_managers/data_manager_mmseqs2_database/data_manager/data_manager_mmseqs2_download.xml

data_managers/data_manager_mmseqs2_database/data_manager_conf.xml

data_managers/data_manager_mmseqs2_database/tool_data_table_conf.xml.sample

tools/mmseqs2/macro.xml

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml

…oolVersionPEP404 error

hugolefeuvre · 2024-12-02T15:34:49Z

Is there any relation between the nt and the tax-nt databases? For instance are there corresponding entries?

What do you mean with relation ? If the databases in mmseqs2_nucleotide_databases and mmseqs2_nucleotide_taxonomy_databases are the same ? For example SILVA which is in both datatable ?
If yes, there is no difference between them. Databases that are in nt and not in tax-nt databases don't have taxonomic information and need createtaxdb if you want to use them as taxonomic database.

bernt-matthias · 2024-12-03T11:55:53Z

One suggestion: If we can agree on a set of data tables and their columns we can also split the PR in two parts: one for the tool and one for the data manager. To me it seems that the datamanager will need a while and will slow down the progreee of the tool.

hugolefeuvre · 2024-12-03T13:41:49Z

One suggestion: If we can agree on a set of data tables and their columns we can also split the PR in two parts: one for the tool and one for the data manager. To me it seems that the datamanager will need a while and will slow down the progree of the tool.

No problem if you think it's the best way to do it. I'm not an expert and I don't know much concerning all data tables already present in galaxy and what we can use and how to do it, but I'm curious to learn.
Questions :

Do I still need to create an issue on MMseqs repo to ask them about db versions ?
How can we agree on a set of data tables ?
The tool part needs a datamanager to function, so won't we have to wait until the DM is finished for the tool part to work? (just wonder why split the PR)

clsiguret and others added 30 commits October 15, 2024 11:04

Init mmseqs2

6467853

init DM

77d1bef

continue DM

3ad2ea4

Split to TOOL_VERSION and COMMIT

b9da697

Modify macros and json output

0ac3019

update macro

75f154d

init mmseqs2_taxonomy

fc600b6

init mmseqs2_createtaxdb

403a2d9

Change name and description

6c308bb

init mmseqs2_createdb

afd1577

init mmseqs2_createtsv

ec54759

init mmseqs2_createtsv

d48ab99

Merge branch 'mmseqs2' of github.com:clsiguret/tools-iuc into mmseqs2

71779aa

continue DM

5f584df

init taxonomyreport

0c250c5

add test files for createtsv

f96947c

Add second test with other data table

09771c7

add double quote

7f2c49e

start create_db

88da237

continue mmseqs2 DM (macros modification)

157360f

put all xml into one

410d4f4

put all xml into one

3c3d188

Merge branch 'galaxyproject:main' into mmseqs2

8fbf055

add createdb section

b82acf0

add createtaxdb and filtertaxseqdb sections

0f859b5

Update taxonomy assignement : taxonomy module prefilter options

0bd7ef7

taxonomy part : align parameters

e1299a9

taxonomy module : misc and common options

76da478

all parameters into xml

d0fac9f

finish wrapping command and start tests

279f796

hugolefeuvre and others added 9 commits November 15, 2024 09:39

start modify DM

98e72f3

modify DM path and json informations

eb4187d

wrong value

1ba4a1f

start multiple datatable management

f46798e

add nucleotide data table into param

65c07c3

Reduced database, possibility of having the 2 types of report

3efd6ab

Merge branch 'galaxyproject:main' into mmseqs2

78ba0c6

delete useless files and parameters + last tests

220cd5a

try to chmod Swiss-Prot_taxonomy because error could not open for wri…

c729ba6

…ting

bernt-matthias reviewed Nov 20, 2024

View reviewed changes

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml Outdated Show resolved Hide resolved

bernt-matthias reviewed Nov 20, 2024

View reviewed changes

tools/mmseqs2/mmseqs2_easy_linclust_clustering.xml Outdated Show resolved Hide resolved

hugolefeuvre and others added 7 commits November 20, 2024 14:35

Create a symlink of the database to the job working directory

072fc28

Ensure that the _taxonomy file is not written to the test data section Co-authored-by: M Bernt <[email protected]>

try symlink with a directory

54a432f

try with database cp and modify DM

fe0b983

remove filtertaxseqdb conditionnal

3637da0

few changes

6d07b87

macros parameters

940613d

Revert "macros parameters"

09e0d94

This reverts commit 940613d. Back before macros when tests passed

hugolefeuvre added 2 commits November 25, 2024 09:54

Revert "Revert "macros parameters""

41cb56d

This reverts commit 09e0d94. Back to macros parameters since tool issue has been resolved

modify alph_type conditionnal, remove createdb-mode parameter and TWI…

2e2364d

…N parameters (automatically set by tool)

bernt-matthias reviewed Nov 29, 2024

View reviewed changes

hugolefeuvre added 2 commits December 2, 2024 11:53

include the commit in the tool version, add .lint_skip file to skip T…

816451e

…oolVersionPEP404 error

few modifications on DM and tools wrapper

dde715a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MMseqs2 clustering and taxonomy #6574

Add MMseqs2 clustering and taxonomy #6574

hugolefeuvre commented Nov 19, 2024

bernt-matthias Nov 20, 2024

hugolefeuvre Nov 20, 2024

bernt-matthias Nov 20, 2024

hugolefeuvre Nov 20, 2024

bernt-matthias Nov 20, 2024

hugolefeuvre Nov 21, 2024

bernt-matthias Nov 21, 2024

bernt-matthias left a comment

hugolefeuvre commented Nov 22, 2024

hugolefeuvre commented Nov 29, 2024

bernt-matthias left a comment

hugolefeuvre commented Dec 2, 2024 •

edited

Loading

bernt-matthias commented Dec 3, 2024

hugolefeuvre commented Dec 3, 2024

Add MMseqs2 clustering and taxonomy #6574

Are you sure you want to change the base?

Add MMseqs2 clustering and taxonomy #6574

Conversation

hugolefeuvre commented Nov 19, 2024

bernt-matthias Nov 20, 2024

Choose a reason for hiding this comment

hugolefeuvre Nov 20, 2024

Choose a reason for hiding this comment

bernt-matthias Nov 20, 2024

Choose a reason for hiding this comment

hugolefeuvre Nov 20, 2024

Choose a reason for hiding this comment

bernt-matthias Nov 20, 2024

Choose a reason for hiding this comment

hugolefeuvre Nov 21, 2024

Choose a reason for hiding this comment

bernt-matthias Nov 21, 2024

Choose a reason for hiding this comment

bernt-matthias left a comment

Choose a reason for hiding this comment

hugolefeuvre commented Nov 22, 2024

hugolefeuvre commented Nov 29, 2024

bernt-matthias left a comment

Choose a reason for hiding this comment

hugolefeuvre commented Dec 2, 2024 • edited Loading

bernt-matthias commented Dec 3, 2024

hugolefeuvre commented Dec 3, 2024

hugolefeuvre commented Dec 2, 2024 •

edited

Loading